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O Abstract 
<N 

Flow scheduling tends to be one of the oldest and most stubborn problems in networking. It becomes more 
crucial in the next generation network, due to fast changing link states and tremendous cost to explore the 
global structure. In such situation, distributed algorithms often dominate. In this paper, we design a distributed 
virtual game to solve the flow scheduling problem and then generalize it to situations of unknown environment, 
where online learning schemes are utilized. In the virtual game, we use incentives to stimulate selfish users 
to reach a Nash Equilibrium Point which is valid based on the analysis of the 'Price of Anarchy'. In the 
unknown-environment generalization, our ultimate goal is the minimization of cost in the long run. In order to 
achieve balance between exploration of routing cost and exploitation based on limited information, we model 
this problem based on Multi-armed Bandit Scenario and combined newly proposed DSEE with the virtual game 
design. Armed with these powerful tools, we find a totally distributed algorithm to ensure the logarithmic growing 
O ■ of regret with time, which is optimum in classic Multi-armed Bandit Problem. Theoretical proof and simulation 

results both affirm this claim. To our knowledge, this is the first research to combine multi-armed bandit with 
£NJ 1 distributed flow scheduling. 
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I. INTRODUCTION 



We consider a network sharing optimization problem. All of the users would like to optimize their own 
path selection without exchanging information with others. However, congestion on the same edge introduces 
increasing cost. We would like to figure out a distributed scheme for them to find a best solution. 

We assume here that each user has a flow with unit capacity requirement but different source or destination. 
However, generalization to multi-commodity situation is not difficult if we split flows into units and carry out the 
algorithm for each unit flow. Cost on each edge is a random variable due to link state changes and environment 
variances. As mentioned above, conflictions increase costs, so we assume the expectation of one such variable 
grows when flows routed on it increase. In the front half of this paper, we assume these expectations are known 
and we focus on the virtual game designing to find the flow scheduling scheme. 

In the second half, we generalize our problem into unknown environment. That is, we do not know the 
expectations of edge costs and we need moderate exploration. We use the newly proposed DSEE Sequence[17] 
to optimize the time for exploration. After exploration, samples of edge costs are stored in routers and the 
sample means are calculated to approximate the expectations. Exploration periods happen periodically in a 
predetermined manner so routers know when to explore. Between two neighboring exploration periods is an 
exploitation period. At the beginning of an exploitation period, we use the distributed Bellman-Ford algorithm[16] 
to calculate routing tables based on the sample means. In order to solve the confliction problem, we apply the 
virtual game here. During the rest time of the exploitation period, we route flows according to the routing tables. 
Obviously, exploration and Bellman Ford periods both introduce extra cost, or reward loss. The ultimate object 
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for us is to design a distributed algorithm to minimize long-run total cost for the whole network. In the whole 
paper, we assume that time is slotted and both explorations and exploitations need time. 

A. Background of Flow Scheduling 

Problems of flow scheduling in known scenario could still be very hard to solve. There are increasing literatures 
in this area with development of the widely-used MPLS network. Here, we base our work on background of 
flow scheduling instead of packet switching, wired or wireless, in order to make it more practical and useful 
nowadays. 

The minimum interference routing [l]-[4] is a prospective direction in flow scheduling. Its purpose can be 
quite similar with ours. However, minimum interference routing algorithms, like MIRA[2] and WSP[4], consider 
more about load balancing to maintain the sustainability of future flow admitting , while we want to solve an 
optimization problem right now. Extensions of our work approve of adaptive scheduling of newly admitted flow 
but all routers should be informed beforehand that new flows have come in. 

Literatures in the Routing Games are more relevant to our problem. Firstly, our modeling is very similar to 
the modeling of the atomic routing in [5]. Secondly, at the Bellman Ford period users perform a virtual game 
and take turns to select their own optimized routing path without considering congestion to others, which is the 
same with routing games. However, there is still fundamental difference between our virtual game and atomic 
routing. Firstly, we let distributed routers decide the best paths for the players, other than players select by 
themselves. This is more reasonable since in real life, routers decide paths for users. Secondly, our game is only 
virtual, which is used finally to solve an optimization problem. However, it is well known that games won't 
always converge to the optimum point. So we set the extra cost one user introduces to the whole network as the 
revenue he pays (see part II.B) to make this non-cooperative game a situation when selfish optimization equals 
social optimization. We prove the fast convergence to Nash Equilibrium Point in this routing game and use the 
constant bound of the 'Price of Anarchy' to measure its worth[9]. Moreover, modeling of [5] does not consider 
the generalization to unknown environment, so our work is more general. 

B. Stochastic online learning based on MAB Problem 

Second half of our paper focuses on the generalization to unknown model. The nature of routing problem with 
unknown edge cost calls for introduction of the Multi-armed Bandit (MAB) Problem. In the classic MAB, there 
are N independent arms and one single player. Each arm, when played, incurs a random cost with an unknown 
distribution. The player should decide the sequence to play each arm to obtain the minimum cost. We notice that 
the player should try to maintain the balance between exploration and exploitation, which respectively means 
to play a new arm and learn its cost distribution and to play the arm with minimum cost. A frequently used 
criterion to judge the performance of an adopted sequence is the so called regret or cost of learning, defined as 
the difference in total cost between the chosen sequence and the optimum sequence when cost distribution is 
known. The best regret, logarithmically growing with time, is obtained in [10] by Lai and Robbins. In [11][12], 
authors gave out index-type policies to achieve logarithmic regret. 

Routing problems with unknown edge cost distributions can be modeled as a variation of the classic MAB 
problem if we view each path as an arm. However, performances of classic algorithms degrade severely here 
since paths with shared edge cannot be viewed as independent. In [13], Liu and Zhao explore the dependence 
of paths to obtain a logarithmic regret. In [14], Gai and Krishnamachari made modifications to UCB1 [12] and 
applied their algorithm LLC into shortest path problem. However, none of them gave out distributed method 
for path selection. In our work, we put this difficulty into the design of a distributed virtual game and solve 
it beforehand in known model. It's important to note that the concept Distributed Learning in [15] is different 
from our concept of 'distributed'. 'Distributed' in [15] means that each user does not exchange information with 
others and finds the best arm on his own. However, we further assume that our algorithm should be carried out 
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distributedly in each router by using the Bellman Ford Algorithm. Moreover, [12]-[15] did not consider network 
sharing, so our work is more general. 

In our paper, we explore an algorithm doing online learning for multi-user situation in a distributed way. To 
our knowledge, no previous work considered such comprehensive situation. Based on our algorithm, the whole 
network can also achieve logarithmic regret with time. However, in order to judge the virtual game at the same 
time, we define regret slightly differently from the classic definition. 

Definition 1: We define Regret as the number of time slots when the network is not in a Nash Equilibrium Point. 

In Regret Analysis part, we analyze the equivalence between definition 1 and the classic one. We prove that our 
virtual game reaches a Nash Equilibrium Point in limited circles, and regret grows logarithmically with time. 
These claims ensure the effectiveness of the virtual game. 

It is important to note that the Optimum Point is also a Nash Equilibrium Point in our game. However, Nash 
Equilibrium Point is not unique since strategy domain for each user is discrete (different paths). Commonly, 
only when we have continuous strategy domain, Nash Equilibrium Point is unique[5][6]. So analysis of the 
Price of Anarchy is necessary. 

II. SYSTEM MODEL 

A. Cost Modeling 

Consider a graph G = (V,E) and K source-destination pairs (s£,ifc), each with unit amount = 1. For 
each edge e G E, define flow on the edge 

eepfc 

in which the pk represents the path chosen by the fcth flow. Since all flows have unit amount, the flow on each 
edge f e will take discrete value from {1, 2, K}. Define 

C(F) = J2ce(fe) (2) 

e€E 

as the total cost in one time slot, in which the c e represents the cost for edge e. At each time slot, for each 
edge e and a certain flow amount f e , c e (f e ) is a random variable whose expectation value increases when f e 
grows. For different time slots, c e (/ e ) is an i.i.d. random process. F denotes the whole flow distribution on the 
network. In order to minimize the time average of C(F), we try to obtain the best flow distribution F in a 
distributed way to minimize the expectation of C(F). Henceforth we use a bar to represent the expectation. For 
example, C(F) denotes the expectation of C(F). The unit amount is the granularity of all flows. Obviously, 
generalization to multi-commodity scenario is trivial if we split flows into flow units and treat each unit as an 
independent flow. 

B. Incentive 

In the virtual game design, users are assumed selfish since they could not exchange information. In order to 
stimulate users to cooperate, we set revenues as incentives for them. Assume at some time t, there are already 
K t flows in the network and the whole flow distribution is currently F t . Then the whole cost of the network 
equals C(F t ). For a certain kth flow, let F t (k) denote the flow distribution when is withdrawn from F t . 
Then we define 

C(F t ) - C(F t (k)) (3) 

as the revenue for the kth flow. We can easily see that when a user has the opportunity to change its routing 
path, he surely chooses the path that introduces the minimum extra cost to the whole network. Then the total 
cost decreases. 
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III. ALGORITHM IN KNOWN MODEL 

A. Virtual Game Design 

In this part, we assume that routers know all c e (/ e ) beforehand. Each user takes turns to hire routers to do 
Bellman Ford Algorithm. The price for each edge is set as the incentive described in II.B. 

There will be N * K time slots reserved for one circle. So time reserved for each user is N slots, and the 
Bellman Ford Algorithm surely converges in such long period. Also, total cost decreases each time when a user 
changes path, since the revenue for this user defined earlier is equal to the extra cost to the whole network 
introduced by him. 

The complete algorithm is as follows: 

1) Take out the kth flow from current flow distribution. If it is the first time for this flow to do path optimization 
and routers do not know yet the path to transmit this flow, they do not need to take it out. 

2) Calculate price on each edge. The price is the extra cost if this edge is chosen: 

P e (F) = C e {fe) ~ Ufe ~ fk) (4) 

3) Start the Bellman Ford Algorithm and wait for N slots to ensure its convergence. The source node is s^. 
Find out the path with minimum price to transmit flow to d^. 

4) Add up f\ on each edge chosen to transmit the kth flow. 

5) Do the 1) again for the k + 1th flow. 

B. Nash Equilibrium Point 

Theorem 1: If we do algorithm described in III. A, then after finite circles, the whole network reaches a Nash 
Equilibrium Point. Convergence time is bounded. 

Proof: During one circle, one of two events below must occur: 

a) .At least one user changes his path. 

b) .No one changes his path. 

If event 'b' happens, we know that no one could change his path unilaterally. Obviously the network has 
reached Nash Equilibrium Point. 

However, if event 'a' happens, total cost decreases. This has been stated in II.B. Since there will be limited 
paths for one flow to take, number of flow distribution is limited, too. So 'a' won't happen all the time. 

We can further figure out the upper bound of convergence time to reach a Nash Equilibrium Point. In fact, 
we need [f^-l times of Bellman Ford circles. The Sm denotes the maximum difference between cost of two 
different flow distribution, and S m denotes the minimum. This is true because during each Bellman Ford circle, 
the cost of the whole network will at least decrease by S m if 'b' does not happen. ED 

IV. PRICE OF ANARCHY 

In this part we give out the analysis of the 'Price of Anarchy'. This notion was originally defined in [8] to 
measure the selfish performance of a simple game of N players that compete for M parallel links. In [9], the 
authors analyzed the price of anarchy of an atomic routing game to polynomial edge cost with nonnegative 
coefficient. They gave out results of d°^ in which d represents the highest order of the polynomial edge cost 
function. This result is considered by [5] to be a significant generalization of previous work. 

In our paper, we still need analysis of the 'Price of Anarchy' since our ultimate goal is to solve an optimization 
problem. So far, we give out algorithm to make different users optimize their own price-the incentive-to reach 
a Nash Equilibrium Point. So we need to figure out the difference between a Nash Equilibrium Point and the 
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optimum point. 

Definition 2: We define the Price of Anarchy as 

C(F N )/C(F*) (5) 

The i*jv represents flow distribution of one Nash equilibrium point. And F* represents flow distribution of the 
optimum point, in which (2) is optimized. 

We give out existence of constant price of anarchy for general polynomial edge cost. Then we give out concrete 
value for polynomials with nonnegative coefficients. Here we need the functions to be convex but this is trivial 
when congestion is concerned. The assumption of polynomial edge cost is common in previous work of Routing 
Games[5][8][9]. Our modeling is different from routing game. In Routing Games, the 'total price' in (6) is equal 
to the expectation of total cost function defined in (2), while they are different in our virtual game. However, 
polynomial functions are quite enough to model congestions in our problem, so we still use this assumption. 

A. General Polynomial Function: Existence 

In this part we prove the existence of constant upper bound of the price of anarchy for polynomial edge cost 
function. In another word, this constant is independent of network size and topology. In the proof we use the 
following definition. 

Definition 3: We define Total Price for distribution F as 

P(F) = J>(/e) " Cetfe ~ fu)} " fe (6) 
e€E 

f u just means the unit flow amount. 

We simply replace f u with 1 in following parts, since we claim in section II.A that all flow has the same 
unit amount. What is important is the reason we define (6) as the 'total price'. In fact, from (3)(4) we know 
that the incentive pricing scheme asks for the kth user a price of 

Pk(F) = 5> e (/ e )-c e (/ e -l)] (7) 

egpfc 

We add up (7) for all users and we get 

K 

^W=EEk(/e)-Ce(/e-l)] (8) 

k=l eepfc 

Simply change the order of summation and we get (6). 

Theorem 2: If the expectation of edge cost function c e (/ e ) is convex and grows polynomial^ with f e , there 
exists a constant bound for the 'Price of Anarchy' independent of network size and flow amount. 

We assume that edge cost functions are polynomials of maximum degree d. Here d is different from the 
degree of barycentric spanner in proof of Theorem 4. 

d 

Ce(fe)=a e f? + J24 ) ft i (9) 
i=l 

First we give out Lemma 1. This is the relationship between total cost and total price. 
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Lemma 1: For a given network G=(V,E), there exist two constant numbers Ai,A r . For any flow distribution 
F, we have 

P(F) 

Ai < ^-4 < A r (10) 

These two numbers are independent of the network size. 

The nature of Lemma 1 is very simple. For a polynomial E(c e ), the numerator and the denominator of (10) 
is of the same order of flow amount f e . So the fraction is certainly limited. We put detailed proof in Appendix 
A. Similarly, we could arrive at the following formula. 

For a given G=(V,E), there exists a constant number A u . For any flow distribution and any edge e, it satisfies 

Ce{fe + 1) - C e (/ e ) 



Ce(fe) ~ C e (f e - I) 



<A U (11) 



Then we give out Lemma 2. This is the 'Variational Inequality Characterization' [5], which describes the 
basic feature of a Nash Equilibrium Point. Proof of Lemma 2 is also put in the appendix. 

Lemma 2: For a given network G=(V,E) and a Nash Equilibrium point F of K users, for any flow distribution 
F', we have 

5>e(/e) " Ce(/e - 1)] • fe < A u £[Ce(/e) " Ce(/e " 1)] ■ fe (1 2 ) 



Based on these two Lemmas, we can complete the proof of Theorem 2. The proof is still very simple in 
nature. We have proven that the total cost(2) and the total price(6) grows with flow amount in the same order 
(Lemma 1). Then we find the constant upper bound of pi^j (Lemma 2). These two steps complete the proof. 
The detailed proof is put in Appendix C. 



B. Polynomial Function with Nonnegative Coefficients: Concrete Value 

For polynomial edge cost with nonnegative coefficients, we give out concrete value of the upper bound. 
Although we could derive a proof based on the same procedure of part IV.A, we can take advantage of the 
nonnegative coefficients to get a relatively simple proof in the Appendix. First we give out some definitions. If 
(8) holds and coefficients are all nonnegative, we have for each edge e 

d 

Ce(fe + 1) " C e (/ e ) = a e [(f e + if - ft] +J>«[(/ e + 1)^ - ft'] (13) 

i=l 

Obviously, all terms in (13) have nonnegative coefficients. We assume 

d 

c e (/e + l)-c e (/e) = ^aW/f 1 (14) 

i=0 

in which c$ > and a e °^ = a e . Moreover, 

d 

J>« = c e (l)-c e (0) (15) 

i=0 
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We assume 

s e = min (a^) (16) 

a<*>>0 

L = max[c e (l) - c e (0)] (17) 

eeE 

Theorem 3 For a given network G=(V,E), if all edge cost functions satisfy (9) and coefficients are nonnegative, 
we could give out the concrete value of the constant upper bound of the Price of Anarchy. The constant is 

[(d + l)Lmax±] d = d 0( - d \ 

V. ALGORITHM IN UNKNOWN MODEL 

From this section, we give out generalization to unknown model. In another word, we further assume that 
the cost distribution of each edge is unknown at the beginning. In order to get enough information about 
the network, we adopt the newly proposed DSEE Sequence algorithm in [17] and cut time into interleaving 
exploration and exploitation periods. A router sends exploration flows to get samples of the cost and store them 
in memory. Based on these samples, a router calculates sample mean and view it as the expectation of edge 
cost when doing Bellman-Ford Algorithm. Between the exploration periods are the exploitation periods, at the 
beginning of which the virtual game is applied. During the rest time of exploitation, users share the network 
based on routing tables. In order not to route flows on edges with high price, each user consents to do enough 
explorations. However, exploration periods and Bellman Ford periods cannot be too long since they introduce 
extra cost to the network. 

A. Exploration 

One exploitation period lasts for N = \V\ time slots. In one exploration period, only one source node Sk 
starts exploration. K source nodes take turns to do exploration in different exploration periods. At the beginning 
of the first exploration period, si sends out a short flow of a random amount k\ to a random edge e r related to 
it to explore the value c er (ki). Then the other node of edge e r receives this flow and forward it in the next time 
slot. This whole exploration period terminate in N = \V\ time slots. In the next exploration period, the source 
node S2 starts exploration instead of s±. The constant number N = \V\ is large enough to ensure a minimum 
probability r = min ee E(r e ) > 0, in which the r e is the probability of the edge e being estimated. 

B. Exploitation 

At the beginning of this period, there will be N * K time slots reserved for a Bellman Ford period. During 
one period, we do one circle of the virtual game described in III.A. 
However, we should replace (4) with 

P e (F) = C e (f e ) - C e (f e - f k ) (18) 

in which c e (/ e ) denotes the sample means stored in routers' memory. 

C. DSEE 

Time is divided into interleaving sequence of Exploration and Exploitation. At the beginning of each ex- 
ploitation period, there is N * K time slots arranged for Bellman Ford period to do virtual game. One Bellman 
Ford period terminates only when the total time N * K is reached. Similarly, one Exploration period ends after 
N time slots. However, the exploitation period ends when the time slot t satisfies 

card(t) < Glog(t) (19) 

in which the card{t) represents number of time slots used to do exploration up to time t. Certainly, the whole 
DSEE Sequence is determined beforehand once the parameter G has been chosen. 
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VI. REGRET ANALYSIS 

We define regret as the number of time slots when all the flows are not routed in Nash Equilibrium Point 
(see the end of the Introduction part). In section III.B, we have proved the inevitability for K users to reach the 
Nash Equilibrium Point in limited circles of virtual game. In this part, we analyze the equivalence of definition 
1 with classic one. Then we prove regret grows logarithmically with time. 

A. Equivalence between Definition 1 and classic definition 

Classic definition of regret is the difference in total cost between the chosen strategy sequence and the 
optimum strategy sequence when cost distribution is known. 

In our algorithm, there exist two conditions that regret increases. The first one is exploration or Bellman Ford. 
During these periods, no flows are transmitted. However, if we define an extra constant cost for each of such 
slot to get a classic definition, we can see that this two regrets grow with time in the same order. The second 
one is when flows are not routed in a Nash Equilibrium Point in an exploitation period. But in one such slot, 
extra cost cannot be larger than Sm- Therefore, even if we define a classic regret, it still grows with same order 
of time. 

The only difference is the distance from one Nash Equilibrium Point to the Optimum Point. However, finding 
the Optimum Point for different flows tends to be NP hard and it cannot be done in a distributed way. So we 
choose to define regret based on a sub-optimal Nash Equilibrium Point which cannot be further improved in a 
distributed manner. Previous parts have shown the constant 'Price of Anarchy' bound, which convince of the 
feasibility of our definition. 

B. Regret Order 

Theorem 4: If the chosen G in (19) satisfies 

Q 72 I 7711 2 

G > max(3 /r, U_ ) (20) 

rc A 

then regret(T) increases with the form 0{log{T)). 

Here we give out some definitions in Theorem 4. 

Definition 4: Let S be a d-dimensional vector space. A set B = {xi, X2, a^} C S is called a barycentric 
spanner for S if every x in S can be written as linear combination of elements of B with coefficients in [—1, 1]. 

It is shown in [15] that if S is a compact set, then it has a barycentric spanner. We know that the set of different 
paths for a certain source-destination pair (sk,dk) is a compact vector space, thus it has a barycentric spanner 

with dimension dh. We assume d = max dk- cr 2 is the largest variance of all the edge cost under different 

k=i~K 

flow distributions, r is the minimum of the probability that a certain edge is chosen during explorations. Ck is 
the minimum price difference between two paths for the kth user under all different flow distributions. Since 
number of flow distributions is limited, Ck surely exists. Then we can definec = min Ck- These parameters 

k=l^K 

are all related to the network topology and can be obtained beforehand. However, while choosing a G based 
on (20) is doable, usually we can choose a smaller G. Here we only concern about the existence of a sufficient 
condition. 

Proof of Theorem 4 still can be found in the Appendix. Instead we give out the basic idea of the proof. If G 
is chosen big enough, sufficient times will be used for exploration so that we have relatively accurate sample 
means for the cost of each edge under different flow amount. Based on Bernstein's inequality, we can bound the 
variance of sample means of path cost. When this variance is small enough, we can bound the probability that 
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we make mistakes in the virtual game circle. Mistake-free virtual game results in Nash Equilibrium. Although 
proof of Theorem 4 seems lengthy, it relies on this simple idea. 

VII. SIMULATIONS 

A. Price of Anarchy Simulation 

In this part we give out simulation result for the 'Price of Anarchy'. Figure 1 shows the probability density 
function of the 'Price of Anarchy' for different cost function orders. Large density near price 1 proves the 
efficiency of our algorithm. Also, the relationship between the 'Price of Anarchy' and cost function order can 
be observed: distribution with a higher order has a longer tail. 
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Price of Anarchy 

Fig. 1: 'Price of Anarchy' distribution 



B. Regret Simulation 

In this part we give out the simulation results for regret order. Figure 2 shows the growing behavior of regret 
with time under different G selections. We choose the Gt as the basic G based on the condition shown in 
Theorem 4. Actually, this condition is just an sufficient condition that leads to logarithmic growing of regret. In 
real simulation, we have chosen a basic G smaller than in Theorem 4 but can still help the logarithmic growth 
hold. 

The second figure is the regret divided by log(T). It could help us see more clearly how the regret converges 
to a logarithmic order. Moreover, we see from simulation that if G is not large enough, the regret grows with 
an order larger than log(T). So in real-life applications, we should make sure that G is large enough. 

VIII. Conclusions 

In this paper, we considered the flow scheduling problem both under known and unknown model. For the 
known model, we proposed a virtual non-cooperative game with incentive pricing to solve cost optimization 
problem for users who do not exchange information with each other. To analyze this virtual game, we proved 
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the fast convergence of the game into a Nash Equilibrium Point which had a bounded price of anarchy. The 
constant bound was proved to be independent of network size and flow amount. Then we extended this algorithm 
to situations when cost distributions were unknown beforehand. We modeled this problem under multi-armed 
bandit model and combined the virtual game with the newly proposed DSEE Sequence which could achieve 
best regret for all light-tail cost distributions. Sure enough, regret of our algorithm was proved to be growing 
logarithmically with time if the DSEE parameters were chosen properly, which is best in the classic online 
learning scenario. Also, simulation results of the 'Price of Anarchy' and the regret growing behavior were given 
out to test the essential correctness of all our claims. 

Appendix A 
Proof of Lemma 1 

Based on (9) we have 

d 

[Ce(fe) ~ C e (f e ~ 1)] ■ f e = a e f e + £ bf ff* (21) 

1 and are coeffic 

(i) A 

require a e to be nonnegative. Divide (21) with and we get 



(i) (i) 

Here a e and b e are coefficients. We do not require them to be nonnegative here, but in Theorem 3, we 



c-.(/.)-J«(/.-l) =ae + E6 , i)/ - i 



d 

Je i=1 

For any e > 0, there exists a / e e . For any f e > / e e , 

\ Uh) -!i h - l) -a e \<e (23) 

Je 



So we have, for any f e > f e 



Ce(fe) ~ C e {f e - 1) 

a e -e< <a e + e (24) 

Je 



Since / ej(E is limited, there exists a closed section I e . For any f e < / e>e , 

Ce(fe) ~ C e (fe ~ 1 



e/ e (25) 

Je 

Since ^AIAzpihz^l > 0, ^ I e . We choose e < ^, and denote J e = I e U ^] and we have for any f e , 

Ce(fe) ~ C e {fe ~ 1) 



d-1 

e 



f, 

Similarly, we divide (9) with /g and finally get 

c e (f, 



el 



G J e (26) 



G J e (27) 



Here J e and J e are both closed sections excluding zero. Then for any flow distribution F, we have 

P(F) _E ee£ [Ce(/e)-Ce(/ e -l)]-/e 



C(F) E eeS C e (/ e ) 

.Ee^^/e)-^-!)]//?- 1 



E ee W/ e )/ie d 

From (26)(27) we know there exist two numbers Ai,A r , for any flow distribution, (10) holds. S 
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Appendix B 
Proof of Lemma 2 

For a certain k E {1, 2, ...,K}, the Nash Equilibrium Point F satisfies 

E ^(/e) " C e (/ e - 1)] < E E ^ + 1) " Ce(/e)] ■ f Pk (29) 

eep^ p fc er fc eepfc 

Here represents the set of all paths available to the kth user. And f Pk = 1 only when the path pk £ is 
chosen by the kth user. Otherwise it equals zero. Obviously, (29) can be derived directly from the definition 
of Nash Equilibrium Point. For different path selection schemes, f Pk varies. However, (29) always holds. For a 
certain flow distribution F', we add up (29) for all K users and get 

P(F N ) < Efe(/e + 1) " Ce(fe)} ■ fj (30) 
eeE 

Since we have (11) already, we can get (12). ED 

Appendix C 
Proof of Theorem 2 

Let F represents a random Nash Equilibrium point and F* denotes the optimum point. For a certain edge e, 
from (26), we have 

[Ce{fe)-Ce{fe-l)]/f d e - 1 ^ ^ 



< ~Ta\ (3D 



[Utt)-Ce{tt-l)]l{f*e) d - 1 ~ jf> 

The J ( e L) and 4 R) are left and right border of J e . And * represents the optimum point. From this inequality 



we can derive directly and get 



r(L) 
•Je 



[Ce(fe) ~ Cede ~ 1)] ■ ft <(-^)M[^(/e) " Ufe ~ 1)] ' fe}~ 

jr> (32) 



Based on the Ho Ider inequality, we get 



{[c e ( f * e ) - c e (f* e - 1)} ■ r e y. 

j(L) 



TV 

E[Ce(/e) " Ce(/e " 1)] ■ /* <(^y)^ ~ Ufe ~ 1)] ■ feV 

eeE Je eeE 

■ {[Ce(r e ) ~ C e (f* e - 1)] ■ f* e } L « 



=(^)- d . [P(F)\- ■ [P(F*)}« 



'e 



'e 
J 

Since (12) holds for every flow distribution F', we could let F' = F* so 

. J e - i r — , , d-i 



j(L) 

<(^)HJ2^(fe) ~ Cede - 1)] ■ fe}^ (33) 
Je eeE 

•{E^(/e)-Ce(/ e *-l)]-/e^ 

eeE 

Up 



It means 



P(F) < A u ■ (-5-)- • [P(F)\- ■ [P(F*)P (34) 

Je 

j(L) 

P(F)/P(F*)<(A u ) d -^ m (35) 

Je 
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And we have Lemma 1, so we finally get 

C(F) P(F) P(F* 



C(F)/C{F* 



P(F) P(F*) C(F*) 

A 7 (L) ( } 



<{A U ) 



From previous Lemmas, we know absolutely that constants on the right side of this inequality are independent 
from network topology and flow distribution. Since Fjq is a random flow distribution, we have proved Theorem 

2. a 

Appendix D 
Proof of Theorem 3 

Conditions in this theorem also ensure the functions are convex. So we have for any flow distribution 

C{F) < P(F) (37) 

From (15) we have, for any i and any e£B 

af < L (38) 

Based on Holder inequality, we have 

+ 1) - c e (/ e )] -/;=EE ^ft'- 1 /: 

e£E i=0 eg-B 

i=0 e£E 



■{E^^/e)^}^ (39) 

i=0 e£E e e&E e 

d 

<Lmax — -^{CiF)}^ 1 {C(F*)}^ 



i=0 



Since C(F*) < C(F), we have 

J>(/e + 1) " Ceife)] ■ ft < (d + l)Lmag - ■ {C(F)}^ {C(F*)}< 



eeE S e 
e£E 



(40) 



For one random Nash equilibrium F and the optimum point F*, from (30)(37) we have 

C(F) < P(F) < J>e(/e + 1) " C e (/ e )] ■ /* (41) 
e€E 

Combining (16)(17)(40)(41) we have 

C(F)/C(F*) < \(d + l)L max —] d = d°^ (42) 

e£E S e 

And this constant is independent of network topology and flow distribution, a 
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Appendix E 
Proof of Theorem 4 

Since the number of time slots used in exploration and Bellman Ford increases strictly with 0(logt), we 
could only focus on the number of slots that all flows are not operating at the Nash equilibrium point. Define 
the At the event that all the flows are not operating at the Nash equilibrium point at time t. We give out the 
upper bound of P(A t ). 

Define P>\ as the event that last Bellman Ford just before time slot t for the /cth flow goes wrong since 
poor estimation of the path cost. Then 

P(B k t ) = P{X*(t) > min peP X p {t)} (43) 

The P denotes the set of paths that the kth flow can choose from. The X p (t) is the incentive price for 
choosing path p. This price is calculated by adding up all the extra edge cost introduced by the kth flow. That 

is _ 

X p (t) = £ c e (/ e ) - c e (f e - f k ) (44) 

e<=E 

The p* represents the real best path for kth flow to choose if price expectation for each edge is known exactly. 
And the X*(t) is the estimated price for choosing this path. 

Let n e (k,t) be the number of times e G E is observed when the k units of flow are put on it up to time 
t during the exploration slots. Let r e (k) represents the probability that e with flow k on it is chosen to be 
observed at a random time slot. Since k can only take values from {1, 2, K} and the number of edges is 
limited, we can ensure the existence of r = minr e . 

e&E 

Obviously, 

E(n e (k,t)) = Gr e (k)logt (45) 
Var(n e (k,t)) < Gr e (k)logt (46) 

so, based on Bernstein's inequality 

P{n e (k,t) < ^Grlogt} <P{n e (k,t) < -Gr e {k)logt} 

<exv{ A gMM) ) (47) 

V 2lE(n e (k,t)) + Var(n e (k,t)) J 

=t -\Gr e {k) < t -l 

Let M = ^Grlogt and we can easily get 

P{3e € E,k G {l,2,...,K},s.t.n e (k,t) < M} < ^ P{n e (k,t) < M} < K\E\t~ l 

e£E,l<k<K 



(48) 



We choose a barycentric spanner in the network and assume it has dk elements {pi,P2, ■■-,Pd k }, then 

{X*(t) > min peP X p (t)} C{X*(t) - X*(t) > p U d ^ {*,(*) " M*) < ~} (49) 

in which 

X t (t) = Y,%Ue + fk) ~ Ce(fe)} (50) 
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and X*(t) represents the real minimum expectation price of the path for fcth flow. 
Specifically for each pi we have 

X,(t) " X,(t) = ^[4(/e + /*) " C e (/ e )] - ^[c e (/e + /*) " C e (/ e )] (51) 

When enough times are used to estimate each edge, the value above will have a high probability to be small. 
Let Li denote the number of edges in p\. Then we have 

P{Xi{t) - X t (t) < -7^-|Ve G E, k G {1, 2, K}, n e (k, t) > M} 

<P{\Mt) - X t (t)\ > -£-|Ve £E,k£ {l,2,...,K},n e (k,t) > M} 
2d k 



< ^P{|c e (/ e )-C e (/ e )|> 



ee Pl 2dkLl 



|Ve G E,k £ {l,2,...,K},n e (k,t) > M} 

+ J2 P{\Ce(fe + fk) ~ C e (f e + fk)\ > 



2d k L l (52) 



<2L ; * 2exp(-- 



|Ve G E,k £ {l,2,...,K},n e (k,t) > M} 

c 

2d k L, 



1 UcLL, ) 2 - 



<4|£7|exp(--— ^— ) 
<4| J B|t~ 1 

Similar upper bound of X*(t) can also be obtained. After that we get 

P{B*) < A{\E\ + l^l 2 )*- 1 + r l < 5|£| V 1 (53) 

Each event leads to the event Aj for some t > t. If we would like to make the whole K flows reach the 
Nash Equilibrium point, we should ensure that B does not happen for a period long enough before time t. In 
fact, if B does not happen, we will need [1^] circles of Bellman Ford period to do virtual game. This result is 
based on Theorem 1. This is because if B does not happen, it tends to be the same situation that routers know 
exactly the cost distribution of each edge. 

The nature of DSEE Sequence makes the start point of each exploration period in an exponential sequence. We 
present this fact in a heuristic way. For the start time t\ of a exploration period, we have 

card(ti) = Glogti (54) 

and for the start point t 2 of the next exploration period we have 

cardfo) = Glogti (55) 

Since card(t\) + NK = cardfo), we have 

^ = exp(-^) (56) 
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Let {ti,t2, ...trs M -,\ denote the starting points of last \^-~\ circles of Bellman Ford period before time t. And 
let tps ii£ -| +1 denote the starting point of the following period after time t. We see obviously that 



exp( -) (57) 



h " v G 

For any Bellman Ford time slot t* between t\ and t^s M ^ 1 1 ,it satisfies that 

_ < — = ea;p( ^ ) (58) 

During these circles of Bellman Ford period, if B does not happen, the A t does not happen either. So we have 

p(A t )< p ( B ")< E ^ivr 1 

t*,k=l,2,...,K t*,fc=l,2,...,K 

_ S M NK\&l + 1] , 

In another word, the total regret to time horizon T can be written as 

Y / P(A t ) = Y / 0(t- 1 ) (60) 

t=i t=i 

and it is 0(ZoflT)H 
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